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Abstract 

In this paper we study the adaptive prefix coding problem in cases 
where the size of the input alphabet is large. We present an online prefix 
coding algorithm that uses 0(cr 1/A+E ) bits of space for any constants e > 0, 
A > 1, and encodes the string of symbols in O(logloga) time per symbol 
in the worst case, where a is the size of the alphabet. The upper bound 
on the encoding length is XnH(s) + (A In 2 + 2 + e)n + 0{a 1/x log 2 a) bits. 

1 Introduction 

In this paper we present an algorithm for adaptive prefix coding that uses sub- 
linear space in the size of the alphabet. Space usage can be an important issue in 
situations where the available memory is small; e.g., in mobile computing, when 
the alphabet is very large, and when we want the data used by the algorithm 
to fit into first-level cache memory. 

For instance, Version 5.0 of the Unicode Standard [T3] provides code points 
for 99 089 characters, covering "all the major languages written today". The 
Standard itself may be the only document to contain quite that many distinct 
characters, but there are over 50 000 Chinese characters, of which everyday 
Chinese uses several thousand [15]. One reason there are so many Chinese 
characters is that each conveys more information than an English character 
does; if we consider syllables, morphemes or words as basic units of text, then the 
English 'alphabet' is comparably large. Compressing strings over such alphabets 
can be awkward; the problem can be severely aggravated if we have only a small 
amount of (cache) memory at our disposal. 

Static and adaptive prefix encoding algorithms that use linear space in 
the size of the alphabet were extensively studied. The classical algorithm of 
Huffman [5] enables us to construct an optimal prefix-free code and encode 
a text in two passes in 0(n) time. Henceforth in this paper, n denotes the 
number of characters in the text, and a denotes the size of the alphabet; 
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H(s) — J2i=i ~FT 1°S2 is the zeroth-order entropj0 of s, where f a denotes 
the number of occurrences of character a in s. The length of the encoding is 
(H + d)n bits, and the redundancy d can be estimated as d < p max + 0.086 where 
Pmax is the probability of the most frequent character [5] . The drawback to the 
static Huffman coding is the need to make two passes over data: we collect the 
frequencies of different characters during the first pass, and then construct the 
code and encode the string during the second pass. Adaptive coding avoids 
this by maintaining a code for the prefix of the input string that has already 
been read and encoded. When a new character Sj is read, it is encoded with 
the code for si . . . Si-x', then the code is updated. The FGK algorithm [TT] for 
adaptive Huffman coding encodes the string in (H + 2 + d)n + Oia logo - ) bits, 
while the adaptive Huffman algorithm of Vitter [16] guarantees that the string 
is encoded in (H + 1 + d)n + 0(a log a) bits. The adaptive Shannon coding 
algorithms of Gagie [4] and Karpinski and Nekrich [10] encode the string in 
(H + l)n + O(aTogcr) bits and (H + l)n + 0(er log 2 a) bits respectively. All 
of the above algorithms use space at least linear in the size of the alphabet, 
to count how often each distinct character occurs. All algorithms for adaptive 
prefix coding, with exception of 10J, encode and decode in <3(nH) time, i.e. 
the time to process the string depends on H and hence on the size of the input 
alphabet. The algorithm of [TO] encodes a string in 0(n) time, and decoding 
takes <3(nlog_ff) time. 

Compression with sub- linear space usage was studied by Gagie and Manzini [5] 
who proved the following lower bound: For any g independent of n and any 
constants e > and A > 1, in the worst case we cannot encode s in XH(s)n + 
o(n log c) + g bits if, during the single pass in which we write the encoding, we 
use 0(ct 1 / a ~ c ) bits of memory. In [5] the authors also presented an algorithm 
that divides the input string into chunks of length 0(cr 1 / A logo - ) and encodes 
each individual chunk with a modification of the arithmetic coding, so that the 
string is encoded with (XH(s) + /i)n + 0(cr 1/ ' A logo - ) bits. However, their al- 
gorithm is quite complicated and uses arithmetic coding; hence, codewords are 
not self-delimiting and the encoding is not 'instantaneously decodable'. Besides 
that, their algorithm is based on static encoding of parts of the input string. 

In this paper we present an adaptive prefix coding algorithm that uses 
0((T 1 / A+e ) bits of memory and encodes a string s with XnH(s) + (A In 2 + 
2 + e)n + 0((T 1 / A log 2 a) bits. The encoding and decoding work in O(loglogcr) 
time per symbol in the worst case, and the whole string s is encoded/decoded 
in 0(n log H(s)) time. A randomized implementation of our algorithm uses 
0(a 1 ^ x log 2 a) bits of memory and works in 0(n\ogH) expected time. Our 
method is based on a simple but effective form of alphabet-partitioning (see, 
e.g., p] and references therein) to trade off the size of a code and the compres- 
sion it achieves: we split the alphabet into frequent and infrequent characters; 
we preface each occurrence of a frequent character with a 1 , and each occurrence 
of an infrequent one with a 0; we replace each occurrence of a frequent character 

1 For ease of description, we sometimes simply denote the entropy by H if the string s is 
clear from the context. 
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by a codeword, and replace each occurrence of an infrequent character by that 
character's index in the alphabet. 

We make a natural assumption that unencoded files consist of characters 
represented by their indices in the alphabet (cf. ASCII codes) , so we can simply 
copy the representation of an infrequent character from the original file. One 
difficulty is that we cannot identify the frequent characters using a low-memory 
one-pass algorithm: according to the lower bound of [9] any online algorithm 
that identifies a set of characters F, such that each s e F occurs at least On 
times for some parameter O, needs J7(a log — ) bits of memory in the worst case. 
We overcome this difficulty by maintaining the frequencies of symbols that occur 
in a sliding window. 

In section[2] we review the data structures that are used by our algorithm. In 
section[3]we present a novel encoding method, henceforth called sliding-window 
Shannon coding. Analysis of the sliding-window Shannon coding is given in 
section |4] 

2 Preliminaries 

The dictionary data structure contains a set S C U, so that for any element 
x G U we can determine whether x belongs to S. We assume that \S\ = m. The 
following dictionary data structure is described in [7] 

Lemma 1 There exists a 0(m) space dictionary data structure that can be 
constructed in 0(m log m) time and supports membership queries in 0(1) time. 

In the case of a polynomial-size universe, we can easily construct a data 
structure that uses more space but also supports updates. The following Lemma 
is folklore. 

Lemma 2 If \U\ — mP^ , then there exists a 0(m 1+e ) space dictionary data 
structure that can be constructed in 0(m 1+£ ) time and supports membership 
queries and updates in O(l) time. 

Proof: We regard S as a set of binary strings of length log U. All strings can be 
stored in a trie T with node degree 2 e log u = m e , where e 1 = (log U / logm) • e. 
The height of T is 0(1), and the total number of internal nodes is 0(m). Each 
internal node uses 0(m £ ) space; hence, the data structure uses 0(m 1+£ ) space 
and can be constructed in 0(m 1+e ) time. Clearly, queries and updates are 
supported in 0(1) time. □ 

If we allow randomization, then the dynamic 0(m) space dictionary can be 
maintained. We can use the result of [3]: 

Lemma 3 There exists a randomized 0(m) space dictionary data structure that 
supports membership queries in O(l) time and updates in O(l) expected time. 
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All of the above dictionary data structures can be augmented so that one or 
more additional records are associated with each element of S; the record(s) 
associated with element b£S can be accessed in O(l) time. 

In Section [31 we also use the following dynamic partial-sums data structure, 
due to Moffat [12] : 

Lemma 4 There is a dynamic searchable partial-sums data structure that stores 
a sequence of O (log a) -bit real numbers pi, . . . ,pk in 0(k\oga) bits and supports 
the following operations in O(logi) time: 

• given an index i, return the i-th partial sum pi + ■ ■ ■ + p,-; 

• given a real number b, return the index i of the largest partial sum p± + 
1- Pi < b; 

• given an index i and a real number d, add d to p^. 

3 Adaptive coding 

The adaptive Shannon coding algorithm we present in this section combines 
ideas from Karpinski and Nekrich's algorithm [10 with the sliding-window ap- 
proach, to encode s in XnH(s) + (A In 2 + 2 + e)n + 0(cr 1//A log 2 a) bits using 
0(n\ogH) time overall and O(loglogcr) time for any character, 0(a 1 ^ x+e ) bits 
of memory and one pass, for any given constants A > 1 and e > 0. Whereas 
Karpinski and Nekrich's algorithm considers the whole prefix already encoded, 
our new algorithm encodes each character s[i] of s based only on the window 
Wi = s[max(i — £, — 1)], where I — \ca x l x logo - ] and c is a constant we 
will define later in terms of A and e. (With c = 10, for example, we produce an 
encoding of fewer than XnH(s) + (2A + 2)n + 0(cr 1 / A log 2 a) bits; with c = 100, 
the bound is XnH(s) + (0.9A + 2)n + 0(a 1 / x log 2 a) bits.) Let f(a,s[i..j}) denote 
the number of occurrences of a in s[i..j]. For 1 < i < n, if f(s[i], Wi) > l/a 1 ^ , 
then we write a 1 followed by s[z]'s codeword in our adaptive Shannon code; 
otherwise, we write a followed by s[i]'s [log cr] -bit index in the alphabet. 

As in the case of the quantized Shannon coding [10] , our algorithm maintains 
a canonical Shannon code. In a canonical code [13l [2], each codeword can be 
characterized by its length and its position among codewords of the same length, 
henceforth called offset. The codeword of length j with offset k can be computed 

™Er=\nh/2 h + (k~i)/y. 

We maintain four dynamic data structures: a queue Q, an augmented dic- 
tionary D, an array A [0.. [logcr 1 ^] , 0.. L ' 1 ^]] and a searchable partial-sums 
data structure P. (We actually use A only while decoding but, to emphasize 
the symmetry between the two procedures, we refer to it in our explanation of 
encoding as well.) When we come to encode or decode s[i], 

• Q stores Wi] 

• D stores each character a that occurs in Wi, its frequency f(a,Wi) there 
and, if f(a,Wi) > i/a 1 ^, its position in A; 
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• A\\ is an array of doubly-linked lists. The list A[j], < j < [logcr 1 ^], 
contains all characters with codeword length j sorted by the codeword 
offsets; we denote by A[j].l the pointer to the last element in A[j]. 

• C[j] stores the number of codewords of length j 

• P stores C[j]/2 J for each j and supports prefix sum queries. 

We implement Q in 0(£logcr) = Oia 1 ^ log 2 a) bits of memory, A in 
0(er 1 / A log 2 a) bits, and P in 0(log 2 er) bits by Lemma [H The dictionary D 
uses 0(cr 1 / A+c ) bits and supports queries and updates in 0(1) worst-case time 
by Lemma [5] if we allow randomization, we can apply Lemma [3] and reduce the 
space usage to 0(cr 1 / A log 2 a) bits, but updates are supported in O(l) expected 
time. Therefore, altogether we use 0{a^ x+e ) bits of memory; if randomization 
is allowed, the space usage is reduced to 0(a 1 ^ x log 2 a) bits. 

To encode s[i], we first search in D and, if f(s[i],Wi) < i/a 1 ^, we simply 
write a followed by s[i]'s index in the alphabet, update the data structures as 
described below, and proceed to s[i + 1]; if f(s[i],Wi) > i/a 1 ^, we use P and 
s[i]'s position A\j, k] in A to compute 

3-1 

^c[h]/2 h + {k-i)/v < i. 

h=0 

The first j = [log(£ / f(s[i], Wi))~\ bits of this sum's binary representation are 
enough to uniquely identify s[i] because, if a character a ^ s[i] is stored at 
A\j',k'], then 



£ C[h]/2 h + (k- l)/2^ - I C t h V 2 ' 1 + ( k> - 

\h=0 / \ h=0 



> l/2 j 



therefore, we write a 1 followed by these bits as the codeword for s[i}. 

To decode s[i], we read the next bit in the encoding; if it is a 0, we simply 
interpret the following [log a] bits as s[i]'s index in the alphabet, update the 
data structures, and proceed to s[i + 1]; if it is a 1, we interpret the following 
[log cr 1 ^] bits (of which s[i]'s codeword is a prefix) as a binary fraction b and 
search in P for index j of the largest partial sum YXS C[h]/2 h < b. Knowing 
j tells us the length of s[i]'s codeword or, equivalently, its row in A; we can also 
compute its offset, 

b-Ei= C[h}/2 h 
2^ 

thus, we can find and write s[i]. 

Encoding or decoding s[i] takes O(l) time for querying D and A and, if 
f(s[i],Wi) > i/a 1 ^, then 



O ( log log ) = O(loglogcr) 
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time to query P. After encoding or decoding s[i], we update the data structures 
as follows: 



• we dequeue s[i — £] (if it exists) from Q and enqueue s[i\; we decrement 
s[i — £]'s frequency in D and delete it if it does not occur in u>i+i; insert 
s[i] into D if it does not occur in Wi or, if it does, increment its frequency; 

• we remove s[i — £] from A (by replacing it with the last character in its 
list A[j], decrementing C[j], and updating D) if 

f(s[i- e],w i+1 ) < £/a l > x < f(s[i- 40; 

• we move s[i — £] from list A[j] to list A[j + 1] if 



Roga 1 /^ > 



log 



f(s[i - £],Wi. 



> 



log 



f(s[i - £],Wi 



this is done by replacing s[i — £] with and appending s[i — £] at 

the end of A[j + 1]; pointers A[j].l and A[j + and counters C[j] and 
C[j + 1] are also updated; 

• if necessary, we insert s[i] into A or move it from A[j] to A[j + 1]; these 
procedures are symmetric to deleting s[i — £] and to moving s[z — from 
A[j] to A[j - 1] ' 

• finally, if we have changed C, the data structure P is updated. 

All of these updates, except the last one, take O(l) time, and updating P takes 
O(loglogfj) time in the worst case. When we insert a new element s[i] into 
Q, this may lead to updating P as described above. We may decrement the 
length of s[i] or insert a new codeword for the symbol s[i]. In both cases, 
we can P updated in 0(length(s[i])) time, where length(s[i\) is the current 
codeword length of s[i]. When we delete an element s[i — I], we may increment 
the codeword length of s[i — £] or remove it from the code. If the codeword 
length is incremented, then we update P in 0(length(s[i — £])) time. If we 
remove the codeword for s[i — £], then we also update P in 0(length(s[« — £])) 
time; in the last case we can charge the cost of updating P to the previous 
occurrence of s[i— £] in the string s, when s[i— £] was encoded with length(s[i— £]) 

bits. The codeword lengths of symbols s[i] and s[i ~ £] are O ( log log w 

and O ^loglog j^zj^ ^pj ) respectively. Hence, by Jensen's inequality, in total 
we encode s in 0(n log H') time, where H' is the average number of bits per 
character in our encoding. In the next section, we will prove that the sliding- 
window Shannon coding encodes s in XnH(s) + (A In 2 + 2 + e)n + 0(a 1 ^ x log 2 a) 
bits. Since we can assume that a is not vastly larger than n, our method works 
in 0(n log H) time. 

If the dictionary D is implemented as in Lemma [3l the analysis is exactly 
the same, but a string s is processed in expected time 0(n\ogH). 
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Lemma 5 Sliding-window Shannon coding can be implemented in 0(n\og H) 
time overall and O (log log a) time for any character, 0(a 1 ^ x+t ) bits of memory 
and one pass. If randomization is allowed, sliding-window Shannon coding can 
be implemented in 0(cr 1 / A log 2 a) bits of memory and 0{n\ogH) expected time. 



4 Analysis 

In this section we prove the upper bound on the encoding length of sliding- 
window Shannon coding and obtain the following Theorem. 

Theorem 1 We encode s in, and later decode it from, XnH(s) + (A In 2 + 2 + 
e)n + 0{a 1 / x log 2 a) bits using 0(n\ogH) time overall and O(loglogcr) time 
for any character, 0(cr 1 / A+e ) bits of memory and one pass. If randomization is 
allowed, the memory usage can be reduced to 0(a 1 ^ x log 2 a) bits and s can be 
encoded and decoded in 0(n\ogH) expected time. 

Proof: Consider any substring s' — s[k..(k +1—1)} of s with length £, and let 
F be the set of characters a such that 

f(a,s[max.(k-e,l)..(k + e- 1)]) > 



notice \F\ < 2cr 1 / A . For k < i < k + 1 - 1, if s\i\ E F but f(s[i],Wi) < ijo 1 ^, 
then we encode s[i] using 

rioga] +1 

< Aloger 1/A + 2 

< ^log 777^ Vl\ +2 

max{f(s[i\,Wi),l) 

~ Al0g max(/( S H,#..(z-l)]),l) +2 
bits; if f(s[i],Wi) > £/o- 1 ' x , then we encode s[i] using 

£ 



log- 



+ 1<A1 ° g max(/( S [z], S [A : ..( i -l)]),l) +2 



f{s[i], Wi) 

bits; finally, if s[i] ^ F, then we again encode s[i] using 

[log a] +1 

< Alogcr 1/A + 2 

£ 

< Xl ° S f (s[i},s[m&x(k - £,l)..(k + £ - 1)]) +2 
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bits. Therefore, the total number of bits we use to encode s' is less than 
\5 Jk l0g max( / (a,#..(z-l)]),l) + 



k<i<k+e-i 



X£]og£-X^2 ]T log (max (/ (a, -!)]),!)) 



aGF »[<]=a, 

k<i<k+l-l 



A ^ /(a, Si ) log /(aV) + 2^; 

aGF 

since 

/(o,s')-l 

logmax(/(a,s[fc..(i- 1)]) , l) = ^ log j , 

»[i]=o, j=l 

fe<i<fc+«-i 
we can rewrite our bound as 

/ f(a,s')-l \ 

A hlogi-Y, E logj-E/( a ' s ')log/(a,s') + 2* 

\ aGF j=l agF / 

= A Ulogt - l °s{{f(a, s') - 1)!) - Yl f( a ' s ') lo S/( a ' s ') ) + 2£ -> 

Y aGF aGF / 

by Stirling's Formula, 

nog^-^log((/(a, S ')-l)!) 

oGF 

= £\ogl-J2 log((/(a, S ')!) + E lo S^ a < s ') 

aGF aGF 

< i log* - (/(a, «') log /(a, a') - /(a, a') In 2) + \F\ log* 

aGF 

< *log*- ^ /(a,s') log /(a,s' ) +£ln2 + 2o- 1 / A log£, 

aGF 

so we can again rewrite our bound as 

A ^log*-^/(a,s')log/(a,s') +*ln2 + 2cr 1 / A log^ + 2£ 

= im ( ,' ) + (Ah2 + 2 + "'" t ''° e ' )i. 



Recall £ = \ca x l x log a], so 
2Act 1 / a log^ 



2Acr 1 / A log [ccr 1 ^ logcr j 

|~CCT 1 / A log cr] 
2A ( log c + (1/A) log cr + log log a + l) 



< 



< 



clogcr 

2A(logc + 3) 



(we will give tighter inequalities in the full paper, but use these here for sim- 
plicity); for any constants A > 1 and e > 0, we can choose a constant c large 
enough that 

2A(logc + 3) 

C 

so the number of bits we use to encode s' is less than XlH(s') + (A In 2 + 2 + e)£. 
With c = 10, for example, 

2A(1 ° SC + 3) <(2-ln2)A, 
c 

so our bound is less than X£H(s') + (2A + 2)1; with c = 100, it is less than 
XeH(s') + {0.9X + 2)£. 

Since the product of length and empirical entropy is superadditive — i.e., 
\si\H(si) + \s 2 \H(s 2 ) < \sis 2 \H(siS2) — we have 

ln/t\-l 

I ]T H(s[(j£ + l)..(j + l)e}) <nH(s) 

3=0 

so, by the bound above, we encode the first i\njl\ characters of s using fewer 
than XnH(s) + (A In 2 + 2 + e)n bits. We encode the last £ characters of s using 
fewer than 

XlH(s[(n - l)..n\) + (A In 2 + 2 + e)l = 0(£ log a) = 0(<j^ x log 2 a) 

bits so, even counting the bits we use for s[(n — £ + !)..£ \n/£\] twice, in total 
we encode s using fewer than 

XnH(s) + (A In 2 + 2 + e)n + 0(cr 1/A log 2 a) 

bits. □ 

If the most common a 1 ' A characters in the alphabet make up much more 
than half of s (in particular, when A = 1) then, instead of using an extra bit 
for each character, we can keep a special escape codeword and use it to indicate 
occurrences of characters not in the code. The analysis becomes somewhat 
complicated, however, so we leave discussion of this modification for the full 
paper. 
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5 Summary 



In this paper we presented an algorithm that uses space sub-linear in the alpha- 
bet size and achieves an encoding length that is close to the lower bound of [5] . 
Our algorithm processes each symbol in 0(log log a) worst-case time, whereas 
linear-space prefix coding algorithms can encode a string of n symbols in 0(n) 
time, i.e. in time independent of the alphabet size a. It is an interesting open 
problem whether our algorithm (or one with the same space bound) can be 
made to run in 0(n) time. 
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